Analyzing and Predicting Credit Card Customer Attrition

by Rohit Nutalapati

CMSC320, Spring 2021

Contents:

  1. Introduction
    • Objective
  2. Data Collection and Storage
  3. Data Representation
    • Target Variable
    • Missing Data
    • Unique Values
  4. Exploratory Data Analysis
    • Balance
    • Spread
    • Initial Exploration
    • Encoding Categorical Variables
    • Multicollinearity
  5. Machine Learning
    • Feature Engineering
    • Data Split and Scaling
    • Hyperparameter Optimization and Performance Metrics
    • Model 1: Logistic Regression
      • Training
      • Testing
    • Model 2: Random Forest
      • Training
      • Testing
    • Model 3: XGBoost
      • Training
      • Testing
  6. Explainability and Insight
    • Interpreting Lime Results
    • Insights Drawn
  7. Conclusion and Resources

1. Introduction

To the world of finance and banking, the future is everything. More specifically, analyzing the past to understand the future is everything. One of the biggest problems faced by companies is the issue of customer attrition. Rates, competition, dissatisfaction, and a whole array of factors can drive customers to close their accounts with a bank. Credit card customer acquisition costs around $200 per user in the United States on average (source), and in some cases it can go much higher. Building a loyal customer base secures revenue, and being able to identify customers that are about to attrite enables a company to be able to strengthen their relationship with their customer and incentivizing them to stay.

Objective

In this tutorial, I will aim to take the reader through the data science pipeline with a customer churn use case. We will explore the factors that result in credit card customer attrition through data analysis as well as machine learning, with a later inclusion of explainable ML.

Given a dataset with information from customers' bank and transaction records, we want to be able to build an ML model that does reasonably well at predicting if a customer is likely to attrite, and then we want to be able to find out why.

2. Data Collection and Storage

The dataset is a set of credit card customer records for an unnamed bank (as most are in publicly available financial datasets). The original source of this data is from a website called LEAPS, which has a walkthrough of using naive bayes classification to solve this problem. However, we will be removing the naive bayes columns from the dataset, because the focus of this data science tutorial is on exploratory data analysis and machine learning.

The dataset can be found and downloaded on Kaggle, as well.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random

I'm uploading the dataset as a .csv file stored on my local system into this notebook.

In [2]:
from google.colab import files
my_file = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving BankChurners.csv to BankChurners (1).csv

Let's take a look at the dataset by storing it in a Pandas DataFrame, df.

In [3]:
df = pd.read_csv("BankChurners.csv")
df.head()
Out[3]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1 Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061 0.000093 0.99991
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105 0.000057 0.99994
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000 0.000021 0.99998
3 769911858 Existing Customer 40 F 4 High School Unknown Less than $40K Blue 34 3 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760 0.000134 0.99987
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000 0.000022 0.99998

Right off the bat, we notice three columns (CLIENTNUM and the two columns with naive bayes results) that will not serve any purpose in this tutorial. Keeping those around will simply induce a headache until the Feature Engineering step in 5. Machine Learning, so we can remove these three columns straight away.

In [4]:
df.drop(columns=["CLIENTNUM", "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1","Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2"], inplace=True)
df.head()
Out[4]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 Existing Customer 40 F 4 High School Unknown Less than $40K Blue 34 3 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000

3. Data Representation

Let's see what Pandas can reveal to us about the size and the data types of the entries in each column, and then cross-reference that with the descriptions from the original source to identify which variables are numerical ones and which are categorical ones, and then list out what they each mean.

In [5]:
df.shape
Out[5]:
(10127, 20)
In [6]:
df.dtypes
Out[6]:
Attrition_Flag               object
Customer_Age                  int64
Gender                       object
Dependent_count               int64
Education_Level              object
Marital_Status               object
Income_Category              object
Card_Category                object
Months_on_book                int64
Total_Relationship_Count      int64
Months_Inactive_12_mon        int64
Contacts_Count_12_mon         int64
Credit_Limit                float64
Total_Revolving_Bal           int64
Avg_Open_To_Buy             float64
Total_Amt_Chng_Q4_Q1        float64
Total_Trans_Amt               int64
Total_Trans_Ct                int64
Total_Ct_Chng_Q4_Q1         float64
Avg_Utilization_Ratio       float64
dtype: object

Now, below is a breakdown of the numerical and the categorial variables along with corresponding descriptions as listed on the original source, LEAPS.

Numerical Variables:

  • Customer_Age: demographic variable - customer's age in years
  • Dependent_count: demographic variable - number of dependents
  • Months_on_book: months on book (time of relationship)
  • Total_Relationship_Count: total no. of products held by the customer
  • Months_Inactive_12_mon: no. of months inactive in the last 12 months
  • Contacts_Count_12_mon: no. of contacts in the last 12 months
  • Credit_Limit: credit limit on the credit card
  • Total_Revolving_Bal: total revolving balance on the credit card
  • Avg_Open_To_Buy: open to buy credit line (average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: change in transaction amount (Q4 over Q1)
  • Total_Trans_Amt: total transaction amount (last 12 months)
  • Total_Trans_Ct: total transaction count (last 12 months)
  • Total_Ct_Chng_Q4_Q1: change in transaction count (Q4 over Q1)
  • Avg_Utilization_Ratio: average card utilization ratio

Catergorical Variables:

  • Attrition_Flag: internal event (customer activity) variable - if the account is closed then 1 else 0
  • Gender: demographic variable - M, F
  • Education_Level: demographic variable - educational qualification of the account holder
  • Marital_Status: demographic variable - Married, Single, Unknown
  • Income_Category: demographic variable - annual income category of the account holder
  • Card_Category: product variable - type of card (Blue, Silver, Gold, Platinum)

Target Variable

Our target variable here, the one that we will focus our analysis around, is Attrition_Flag, according to the goal set forth in 1. Introduction.

Missing Data

There's another problem that's immediately visible when we examine the output of df.head(), which is the presence of rows with "Unknown" values. This dataset does have some missing data. Since we do not have too much information about how the data was collected, attempting to fill in these values with mean values, mode values, or some other form of extrapolation might not present us with an accurate enough picture.

It is most likely best to remove rows with "Unknown" values from the dataset since the data we have will eventually be used in ML further below.

In [7]:
for i in df.columns:
  df = df[df[i] != "Unknown"]
df.shape
Out[7]:
(7081, 20)

If this hadn't left us with so much complete data, it would be worth digging deeper into finding another way to combat the missing data. Fortunately, we only lost approximately 30% of the dataset to missing data, which still leaves us plenty to analyze.

Unique Values

With the missing data gone, let's now see how many unique values each feature holds.

In [8]:
pd.concat([df.nunique()], axis= 1).rename(columns={0:'Unique_Values'}).sort_values('Unique_Values')
Out[8]:
Unique_Values
Attrition_Flag 2
Gender 2
Marital_Status 3
Card_Category 4
Income_Category 5
Total_Relationship_Count 6
Dependent_count 6
Education_Level 6
Months_Inactive_12_mon 7
Contacts_Count_12_mon 7
Months_on_book 44
Customer_Age 45
Total_Trans_Ct 124
Total_Ct_Chng_Q4_Q1 771
Avg_Utilization_Ratio 946
Total_Amt_Chng_Q4_Q1 1067
Total_Revolving_Bal 1821
Total_Trans_Amt 4194
Credit_Limit 4654
Avg_Open_To_Buy 5144

It seems like all of the categorical variables we identified in organizing our initial list have low counts of unique values, which is great since it would be hard to make sense of categorical variables that could take on a wide range of values - they may as well be continuous beyond a point. We can also rest assured that our focus, Attrition_Flag, only has two unique values. Phew!

4. Exploratory Data Analysis

Now that we've been able to organize our understanding of the features we have in the dataset, let's look at some key relationships between the variables that can help inform our judgement as to what affects credit card customer attrition.

Balance

A good starting point would be to check the balance of the dataset and see what proportion of our data represents customers that actually attrited.

In [9]:
df.Attrition_Flag.value_counts()
Out[9]:
Existing Customer    5968
Attrited Customer    1113
Name: Attrition_Flag, dtype: int64

This is a very imbalanced dataset, with only about 16% who attrited!

Spread

Let's get some preliminary information about each of the numerical columns in our dataset, and see what we can learn from that.

In [10]:
df.describe().apply(lambda s: s.apply('{0:.3f}'.format))
Out[10]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
count 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000 7081.000
mean 46.348 2.338 35.981 3.819 2.343 2.454 8492.774 1167.502 7325.272 0.761 4394.300 64.503 0.712 0.282
std 8.041 1.292 8.003 1.544 0.995 1.105 9126.073 812.316 9131.218 0.223 3468.462 23.809 0.239 0.279
min 26.000 0.000 13.000 1.000 0.000 0.000 1438.300 0.000 3.000 0.000 510.000 10.000 0.000 0.000
25% 41.000 1.000 31.000 3.000 2.000 2.000 2498.000 463.000 1248.000 0.629 2089.000 44.000 0.583 0.026
50% 46.000 2.000 36.000 4.000 2.000 2.000 4287.000 1282.000 3250.000 0.735 3831.000 67.000 0.700 0.186
75% 52.000 3.000 40.000 5.000 3.000 3.000 10729.000 1781.000 9491.000 0.858 4740.000 80.000 0.818 0.515
max 73.000 5.000 56.000 6.000 6.000 6.000 34516.000 2517.000 34516.000 3.397 17995.000 134.000 3.714 0.999

Let's make note of some major observations and potential explanations:

  • Customer_Age is centered around 46, not dropping below 26 and not exceeding 73. This could indicate that this bank doesn't focus on the lower-age demographic through means like e-banking or a smartphone app (even though those may be options), and there don't seem to be any accounts listed for minors.
  • Dependent_count shows us most customers have at least two dependents, spanning all the way up to five, which seems fitting given the spread of the customers' ages.
  • Months_on_book seems to range from accounts that have been open for just over a year to accounts that have been open for about four years. A little unexpected, given that the age range is on the higher side, but perhaps in the data we've been given the bank wanted to focus on the younger accounts which have a naturally higher attrition propensity.
  • Months_Inactive_12_mon certainly seems like an important feature to retain as a telltale sign of a customer about to churn.
  • Contacts_Count_12_mon ranges from no contacts to six contacts in the last year. On the surface, this may seem like more contacts is a good thing, since it signals a lack of customer inactivity. However, that's not something to immediately conclude without knowing if the contacts were about, for example, a malfunctioning product.
  • Customers with high a Total_Revolving_Bal and Avg_Utilization_Ratio would probably be more likely to stay.

Initial Exploration

Let's verify if that last point on the list above is true by examining violinplots to show us the distributions.

In [11]:
plt.figure(figsize=(6, 8))
sns.violinplot(data=df, x="Attrition_Flag", y="Total_Revolving_Bal", palette="mako")
plt.title("Distributions of Total_Revolving_Bal Across Attrition")
Out[11]:
Text(0.5, 1.0, 'Distributions of Total_Revolving_Bal Across Attrition')
In [12]:
plt.figure(figsize=(6, 8))
sns.violinplot(data=df, x="Attrition_Flag", y="Avg_Utilization_Ratio", palette="mako")
plt.title("Distributions of Average Card Utilization Ratio Across Attrition")
Out[12]:
Text(0.5, 1.0, 'Distributions of Average Card Utilization Ratio Across Attrition')

It seems that these factors do make a noticeable difference, as expected - when these values are higher, the customer seems less likely to attrite. While we're at it, let's take a look at the difference between the number of products customers have held when they've stayed versus attrited.

In [13]:
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x="Total_Relationship_Count", hue="Attrition_Flag", palette="mako")
plt.title("Counts of Attrition Across Numbers of Products Held")
Out[13]:
Text(0.5, 1.0, 'Counts of Attrition Across Numbers of Products Held')

This is also a useful piece of information to us. Looking at the relative distributions, customers with more products with the bank are more likely to stay, and those with less products are more likely to attrite.

Intuitively, there should be a link between Customer_Age and Months_on_book, since older customers would tend to have longer open accounts. Let's examine a scatter plot between those two variables with a regression line.

In [14]:
plt.figure(figsize=(10,5))
sns.regplot(data=df, x="Customer_Age", y="Months_on_book")
plt.title("Trend Between Age and Months On Book")
Out[14]:
Text(0.5, 1.0, 'Trend Between Age and Months On Book')

That is a very strong correlation! It does make logical sense for this trend to exist, as previously noted, but this could pose problems when we search for collinearity in our data later on.

Another intriguing aspect of the shape of the scatter plot here is the values of Months_on_book centering heavily around a singular value between 30 and 40. Let's examine that distribution.

In [15]:
plt.figure(figsize=(10,5))
sns.countplot(data=df, x="Months_on_book", hue="Attrition_Flag", palette="mako")
plt.title("Count Distribution of Months On Book")
Out[15]:
Text(0.5, 1.0, 'Count Distribution of Months On Book')

That is such a heavy center of the distribution! While the overall spread looks symmetrical, it's strange that so many of our datapoints lie squarely at 36 months of having an open account with the bank. We don't have too much temporal context about the collection of this data to be able to guess why, but this variable probably would not be doing our ML model too many favors.

Does Customer_Age's effect on Attrition_Flag have any surprises in store for us?

In [16]:
plt.figure(figsize=(10,5))
sns.violinplot(data=df, x="Attrition_Flag", y="Customer_Age", palette="mako")
plt.title("Distributions of Age Across Attrition")
Out[16]:
Text(0.5, 1.0, 'Distributions of Age Across Attrition')

There's barely any difference between the the age distributions of customers who attrited and customers who stayed with the bank. At least, that's what the data we have tells us. It doesn't seem lke there's a major discernible pattern here, save for a few outliers in the existing customers' distribution.

Another factor that we'd think would affect Attrition_Flag is Income_Category - wouldn't higher income people be less likely to attrite? Examining that distribution could tell us if our little informal hypothesis is accurate.

In [17]:
plt.figure(figsize=(10,5))
sns.countplot(x="Income_Category", data=df, hue="Attrition_Flag", palette="mako")
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f898707c810>

Just by eyeballing it, the income category distributions between the existing and the attrited customers don't look all that different. Maybe it doesn't affect Attrition_Flag as much as we thought. A more rigorous analysis further below can confirm that.

We've been able to identify some high-level information, examine a couple preliminary distributions, and make some broad observations. However, this is only giving us part of the picture. We need to look at all the features we have at hand and see what we can analyze in terms of both interdependence of the feature set as well as the target variable, Attrition_Flag.

Encoding Categorical Variables

In order for us to be able to assess relationships between our variables further, it would be best to convert the categorical values we have into a numerical form as well for better comparison. Not to mention, ML models don't digest unless they're made into numbers!

Unfortunately, an ordinal encoding may (by definition) introduce some level of unintended ordering in those variables, but the other option is to convert those values into dummy columns (one-hot-encoded), which would be harder to search for correlations in. We'll go ahead and encode them ordinally.

Note: Attrition_Flag will be encoded such that 0 represents a customer who stayed and 1 represents a customer who attrited, since the problem is focused towards the event of attrition.

In [18]:
categories_encoded = {
                "Attrition_Flag": {"Existing Customer": 0, "Attrited Customer": 1},
                "Gender": {"M": 1, "F": 2}, 
                "Education_Level": {"Uneducated": 1, "High School": 2, "College": 3, "Graduate": 4, "Post-Graduate": 5, "Doctorate": 6}, 
                "Marital_Status": {"Single": 1, "Divorced": 2, "Married": 3}, 
                "Income_Category": {"Less than $40K": 1, "$40K - $60K": 2, "$60K - $80K": 3, "$80K - $120K": 4, "$120K +": 5}, 
                "Card_Category": {"Blue": 1, "Silver": 2, "Gold": 3, "Platinum": 4}
                }

df.replace(categories_encoded, inplace=True)
df.head()
Out[18]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 0 45 1 3 2 3 3 1 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 0 49 2 5 4 1 1 1 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 0 51 1 3 4 3 4 1 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
4 0 40 1 3 1 3 3 1 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000
5 0 44 1 2 4 3 2 1 36 3 1 2 4010.0 1247 2763.0 1.376 1088 24 0.846 0.311

Fortunately, most of those variables work fine with an ordered encoding! That means that there is a broad level of natural ordering that can be provided to them that makes sense, such as increasing Education_Level and increasing Income_Category.

The only one that really doesn't have an order to it is Gender, but we simply have to pick an arbitrary order to continue with analysis. Let's just confirm that we're left with only numerical values in each column before moving onto more advanced EDA.

In [19]:
df.dtypes
Out[19]:
Attrition_Flag                int64
Customer_Age                  int64
Gender                        int64
Dependent_count               int64
Education_Level               int64
Marital_Status                int64
Income_Category               int64
Card_Category                 int64
Months_on_book                int64
Total_Relationship_Count      int64
Months_Inactive_12_mon        int64
Contacts_Count_12_mon         int64
Credit_Limit                float64
Total_Revolving_Bal           int64
Avg_Open_To_Buy             float64
Total_Amt_Chng_Q4_Q1        float64
Total_Trans_Amt               int64
Total_Trans_Ct                int64
Total_Ct_Chng_Q4_Q1         float64
Avg_Utilization_Ratio       float64
dtype: object

Multicollinearity

Multicollinearity is the presence of independent variables that are interrelated to a high degree, making it difficult to distinguish their individual effects on the target variable.

In our case, we want to identify which of our features contribute to multicollinearity so that we can make it easier for our ML model coming in later to distinguish between these factors in predicting Attrition_Flag.

From the initial analysis, it does seem like the exact relation between attrition and most major features is hard to pinpoint. One commonly used way to measure which variables are contributing most to multicollinearity is VIF (Variance Inflation Factor).

VIF is given by the formula:

VIF.png

where R^2 is the coefficient of determination in linear regression. Since it intrinsically uses the same idea as linear regression, but applied between each variable and all other variables, VIF is a good way to detect collinearity amongst variables. A VIF value much larger than 10 is best avoided.

In [20]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_independent_vars = df.iloc[:, 1:18] # only examining the features

vif_data = pd.DataFrame()
vif_data["Feature"] = vif_independent_vars.columns
vif_data["VIF"] = [variance_inflation_factor(vif_independent_vars.values, i) for i in range(len(vif_independent_vars.columns))]
vif_data
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
/usr/local/lib/python3.7/dist-packages/statsmodels/stats/outliers_influence.py:185: RuntimeWarning: divide by zero encountered in double_scalars
  vif = 1. / (1. - r_squared_i)
Out[20]:
Feature VIF
0 Customer_Age 84.666054
1 Gender 18.885704
2 Dependent_count 4.356983
3 Education_Level 5.594990
4 Marital_Status 5.831999
5 Income_Category 10.051182
6 Card_Category 15.296336
7 Months_on_book 56.933513
8 Total_Relationship_Count 7.766398
9 Months_Inactive_12_mon 6.379486
10 Contacts_Count_12_mon 5.739126
11 Credit_Limit inf
12 Total_Revolving_Bal inf
13 Avg_Open_To_Buy inf
14 Total_Amt_Chng_Q4_Q1 11.654772
15 Total_Trans_Amt 8.572890
16 Total_Trans_Ct 24.196200

There are some variables (Credit_Limit, Total_Revolving_Bal, and Avg_Open_To_Buy) with inf VIF values, meaning they are perfectly correlated with some other variables, reducing the so-called "independence" in our independent variable set (or feature set). We also see that Customer_Age and Months_on_book also have very high VIF values.

We could go into a feature-by-feature analysis of p-values or ANOVA to truly pick apart the underlying influences, but that would get rather tedious given how many features we have, especially when part of this project's goal is to let a machine learning model decide what's important!

That isn't to say that we should ignore those interrelated features, however. There's another big-picture way we can use to quickly understand what features are correlated with what others. That big-picture way (literally, the figure is quite large) is Seaborn's heatmap, which shows us one-on-one interactions between variables.

In [21]:
plt.figure(figsize=(17, 15))
sns.heatmap(data=df.corr(), annot=True)
plt.ylabel("Feature 1")
plt.xlabel("Feature 2")
plt.title("Heatmap of Features")
Out[21]:
Text(0.5, 1.0, 'Heatmap of Features')

As we can see, the same variables from the VIF analysis that we noticed were very correlated with other variables are indeed reflected in the heatmap, but now on a one-on-one interaction each. Below is a list of the major "red flags" for us (so to speak) and the other variables they show correlations with:

  • Customer_Age, Months_on_book
  • Credit_Limit, Gender, Income_Category, Card_Category, Avg_Open_To_Buy, and Avg_Utilization_Ratio
  • Total_Revolving_Bal, Avg_Utilization_Ratio
  • Total_Trans_Ct, Total_Trans_Amt

Logically, there do seem to be links between these variables, as previously alluded to. For example, like we saw in the initial exploration we did, the older a customer is, the more likely they'll have an open account with the same bank for longer. Another example is that the total transaction amount goes up with the total transaction count.

However, having multiple such variables could pose a problem in training our machine learning model. We'll try to either get rid of them or convert them into more useful values in the Feature Engineering part of 5. Machine Learning.

5. Machine Learning

With all the analysis we've done so far, we've been able to derive some degree of statistics-based human judgement as to how the variables we have interact with each other.

The goal for us now is to create a machine learning model to learn to take in several input features about a customer and predict whether that customer is likely to attrite or not.

Creating just one model for the job sounds a little hit-or-miss, so we'll try a couple of different ones until we reach a satisfactory performance.

Feature Engineering

Before that, however, let's act on the insights we've gained from 4. Exploratory Data Analysis. We've made some observations in our initial basic exploration, and we've noticed some features contribute to multicollinearity. Let's use that understanding to transform our feature set before feeding it into any ML model.

Here's a little checklist of things we can modify about our data:

  1. Drop Months_on_book. Reason: peculiar distribution of data as well as heavy contribution to multicollinearity.
  2. Drop Avg_Utilization_Ratio. Reason: heavy contribution to multicollinearity with several variables, and not much correlation with Attrition_Flag.
  3. Drop Avg_Open_To_Buy. Reason: heavy contribution to multicollinearity with several variables, and not much correlation with Attrition_Flag.
  4. Drop Gender. Reason: some contribution to multicollinearity without much impact on Attrition_Flag.
  5. Drop Credit_Limit. Reason: heavy contribution to multicollinearity without much impact on Attrition_Flag.
  6. Drop Customer_Age. Reason: some contribution to multicollinearity without much impact on Attrition_Flag.
  7. Divide Total_Trans_Amt by Total_Trans_Ct to get Amt_Per_Trans, and then drop Total_Trans_Amt and Total_Trans_Ct. Reason: an average amount per transaction is a better indicator of the customer's interest in continuing to use their credit card than two aggregate values which each contribute to multicollinearity.

divide contacts by products

In [22]:
df["Amt_Per_Trans"] = df["Total_Trans_Amt"]/df["Total_Trans_Ct"]
df.drop(columns=["Total_Trans_Ct", "Total_Trans_Amt", "Months_on_book", "Avg_Utilization_Ratio", "Customer_Age", "Avg_Open_To_Buy", "Gender", "Credit_Limit"], inplace=True)
df.head()
Out[22]:
Attrition_Flag Dependent_count Education_Level Marital_Status Income_Category Card_Category Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Total_Revolving_Bal Total_Amt_Chng_Q4_Q1 Total_Ct_Chng_Q4_Q1 Amt_Per_Trans
0 0 3 2 3 3 1 5 1 3 777 1.335 1.625 27.238095
1 0 5 4 1 1 1 6 1 2 864 1.541 3.714 39.121212
2 0 3 4 3 4 1 4 1 0 0 2.594 2.333 94.350000
4 0 3 1 3 3 1 5 1 0 0 2.175 2.500 29.142857
5 0 2 4 3 2 1 3 1 2 1247 1.376 0.846 45.333333
In [23]:
df.shape
Out[23]:
(7081, 13)

As a certain legendary CMSC320 professor once said to his Spring 2021 class, it's important to allow the next stage of the data science pipeline to inform the previous stage, so that we optimize our approach over time. In lieu of that, let's check if we've combatted the issue of multicollinearity in our variables.

In [24]:
vif_independent_vars = df.iloc[:, 1:13]
vif_data = pd.DataFrame()
vif_data["Feature"] = vif_independent_vars.columns
vif_data["VIF"] = [variance_inflation_factor(vif_independent_vars.values, i) for i in range(len(vif_independent_vars.columns))]
vif_data
Out[24]:
Feature VIF
0 Dependent_count 4.120050
1 Education_Level 5.323690
2 Marital_Status 5.485633
3 Income_Category 3.975358
4 Card_Category 10.960939
5 Total_Relationship_Count 7.022855
6 Months_Inactive_12_mon 5.922546
7 Contacts_Count_12_mon 5.493939
8 Total_Revolving_Bal 3.066947
9 Total_Amt_Chng_Q4_Q1 14.114600
10 Total_Ct_Chng_Q4_Q1 11.282850
11 Amt_Per_Trans 6.875759

The VIF values are looking much better, and much closer to our desired range! It's time to finally move onto the next step.

Data Split and Scaling

In ML, there's a four-way division of the data: the features-label split and the training-testing split.

The features-label split allows us to explicitly distinguish the target variable, or y (in our case, the label, which is Attrition_Flag), from the remaining independent variables, or x (everything we've been left with after our feature engineering step).

Let's shuffle the dataset for good measure before proceeding with splitting.

In [25]:
df = df.sample(frac=1).reset_index(drop=True) # shuffling dataset

Below, df_y will store the label as a Pandas DataFrame, whereas y will store it as a NumPy Ndarray.

In [26]:
df_y = df[["Attrition_Flag"]]
y = df_y.to_numpy()
y
Out[26]:
array([[0],
       [0],
       [1],
       ...,
       [0],
       [0],
       [1]])

Now, because the values we have in each column of our feature set vary so widely across their minimum and maximum values, it's a good idea to apply some sort of scaling (or normalization, in a sense) to the data so that the machine learning models can converge more efficiently. We'll apply scikit-learn's StandardScaler, which scales the values of each column down based on their standard deviations from the mean. This process also reduces the effect of outliers on the predictive models.

As we did with the label, we'll store the features as a Pandas DataFrame in df_x, and as a NumPy Ndarray (after scaling) in x.

In [27]:
df_x = df.iloc[:, 1:13]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x = scaler.fit_transform(df_x)
x
Out[27]:
array([[-0.26154879, -0.75865789,  0.95600102, ...,  1.61084098,
        -0.79399471, -0.64952828],
       [ 0.51270999,  0.66496912,  0.95600102, ...,  1.11784026,
        -1.10822812,  0.49437013],
       [ 0.51270999,  0.66496912,  0.95600102, ..., -1.70122751,
        -1.43922063, -0.15305565],
       ...,
       [ 1.28696877, -1.47047139, -1.13818095, ..., -0.67488964,
        -0.34149862, -0.38668583],
       [ 1.28696877,  1.37678262, -0.09108997, ..., -1.49954539,
        -0.15295858,  0.27474723],
       [ 1.28696877,  0.66496912, -0.09108997, ..., -0.57628949,
        -1.83724961,  0.8792384 ]])

And now, the second two-way split of the data: the training-testing split. This is important because we want to distinguish the seen data from the unseen data. This means that the machine learning models will attempt to "converge", or reach an "understanding" (optimal configuration of parameters), over the seen data, or the training set.

In essence, they use the training set to try to learn what makes a customer attrite.

For us to gauge how well they have trained itself, we can assess their performance on the unseen data, or testing set, and make use of some common performance metrics to evaluate them.

We'll go with the common training:testing set ratio of 80:20.

We can go ahead and split x into x_train and x_test and y into y_train and y_test.

In [28]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

Now that we've prepared our data to be trained and tested on, let's get our first ML model on the job!

Hyperparameter Optimization and Performance Metrics

In the machine learning models that are about to follow, you will notice that they are not running entirely on stock/default settings. At the time of creating this notebook, I conducted some basic hyperparameter tweaking to produce reasonably good performances from each model.

The performance metrics we'll rely on in assessing training and testing performance will be accuracy, precision, recall, and AUC-ROC. Out of all of these metrics, our primary focus will be on maximizing following two metrics to a reasonable extent:

  1. Recall - due to the high impact of false negatives in our case. A false negative is an attrited customer assumed to stay (truth: 1, guess: 0), which is worse than a false positive (truth: 0, guess: 1).
  2. AUC-ROC - due to how good a measure it is of the model's ability to tell positive and negative cases apart.

Here is a quick visual summary of performance metrics:

Confusion-matrix-and-evaluation-metrics.png

In [29]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score

Model 1: Logistic Regression

This seems like a good place to start with this problem. After all, a logistic regression model weights multiple inputs and compresses it into a prediction between 0 and 1. Let's see how it does.

Training

In [30]:
from sklearn.linear_model import LogisticRegression

model_1 = LogisticRegression(C=500, max_iter=2000)
model_1.fit(x_train, y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[30]:
LogisticRegression(C=500, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [31]:
train_predictions = model_1.predict(x_train)
In [32]:
cm_train = confusion_matrix(y_train, train_predictions)
sns.heatmap(cm_train, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Training Confusion Matrix")
Out[32]:
Text(0.5, 1.0, 'Training Confusion Matrix')
In [33]:
print("Training Accuracy: ", accuracy_score(y_train, train_predictions))
print("Training Precision:", precision_score(y_train, train_predictions))
print("Training Recall:   ", recall_score(y_train, train_predictions))
print("Training AUC-ROC:  ", roc_auc_score(y_train, train_predictions))
Training Accuracy:  0.8824152542372882
Training Precision: 0.7558139534883721
Training Recall:    0.36681715575620766
Training AUC-ROC:   0.6724207168483842

Testing

In [34]:
test_predictions = model_1.predict(x_test)
In [35]:
cm_test = confusion_matrix(y_test, test_predictions)
sns.heatmap(cm_test, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Testing Confusion Matrix")
Out[35]:
Text(0.5, 1.0, 'Testing Confusion Matrix')
In [36]:
print("Testing Accuracy:  ", accuracy_score(y_test, test_predictions))
print("Testing Precision: ", precision_score(y_test, test_predictions))
print("Testing Recall:    ", recall_score(y_test, test_predictions))
print("Testing AUC-ROC:   ", roc_auc_score(y_test, test_predictions))
Testing Accuracy:   0.876499647141849
Testing Precision:  0.7241379310344828
Testing Recall:     0.3700440528634361
Testing AUC-ROC:    0.6715766482804575

Our logistic regression model did not do so well on this task. We have lots of false negatives, as our low recall score and our confusion matrix show us.

Model 2: Random Forest

Let's see if our luck goes up with a random forest classifier! After all, it's a meta estimator (averaging the results of individual decision trees). Hopefully, it'll perform better than the singular logistic regression model we used.

Training

In [37]:
from sklearn.ensemble import RandomForestClassifier

model_2 = RandomForestClassifier(max_depth=40)
model_2.fit(x_train, y_train)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  after removing the cwd from sys.path.
Out[37]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=40, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [38]:
train_predictions_2 = model_2.predict(x_train)
In [39]:
cm_train_2 = confusion_matrix(y_train, train_predictions_2)
sns.heatmap(cm_train_2, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Training Confusion Matrix")
Out[39]:
Text(0.5, 1.0, 'Training Confusion Matrix')
In [40]:
print("Training Accuracy: ", accuracy_score(y_train, train_predictions_2))
print("Training Precision:", precision_score(y_train, train_predictions_2))
print("Training Recall:   ", recall_score(y_train, train_predictions_2))
print("Training AUC-ROC:  ", roc_auc_score(y_train, train_predictions_2))
Training Accuracy:  1.0
Training Precision: 1.0
Training Recall:    1.0
Training AUC-ROC:   1.0

Those values seem just a little suspicious, but it's not uncommon for random forests to perform like this on training data. It does stink of overfitting, though.

Testing

In [41]:
test_predictions_2 = model_2.predict(x_test)
In [42]:
cm_test_2 = confusion_matrix(y_test, test_predictions_2)
sns.heatmap(cm_test_2, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Testing Confusion Matrix")
Out[42]:
Text(0.5, 1.0, 'Testing Confusion Matrix')
In [43]:
print("Testing Accuracy:  ", accuracy_score(y_test, test_predictions_2))
print("Testing Precision: ", precision_score(y_test, test_predictions_2))
print("Testing Recall:    ", recall_score(y_test, test_predictions_2))
print("Testing AUC-ROC:   ", roc_auc_score(y_test, test_predictions_2))
Testing Accuracy:   0.9103740296400847
Testing Precision:  0.852112676056338
Testing Recall:     0.5330396475770925
Testing AUC-ROC:    0.7576962943767814

There we are, our recall finally crossed 0.5! This is a decent improvement on the logistic regression model. Our AUC-ROC score went up as well, so the random forest classifier seems to be better at distinguishing between labels.

Model 3: XGBoost

Naturally, we can't let a machine learning problem go by without giving extreme gradient boosting a shot at it. Hopefully, we can beat the random forest classifier's scores, and if we're successful, we can move onto explaining attrition using this model.

In [44]:
from xgboost import XGBClassifier

model_3 = XGBClassifier(use_label_encoder=False, learning_rate=1, n_estimators=70)
model_3.fit(x_train, y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:235: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py:268: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[44]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=1,
              max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
              n_estimators=70, n_jobs=1, nthread=None,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
              subsample=1, use_label_encoder=False, verbosity=1)

Training

In [45]:
train_predictions_3 = model_3.predict(x_train)
In [46]:
from sklearn.metrics import confusion_matrix

cm_train_3 = confusion_matrix(y_train, train_predictions_3)
sns.heatmap(cm_train_3, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Training Confusion Matrix")
Out[46]:
Text(0.5, 1.0, 'Training Confusion Matrix')
In [47]:
print("Training Accuracy: ", accuracy_score(y_train, train_predictions_3))
print("Training Precision:", precision_score(y_train, train_predictions_3))
print("Training Recall:   ", recall_score(y_train, train_predictions_3))
print("Training AUC-ROC:  ", roc_auc_score(y_train, train_predictions_3))
Training Accuracy:  0.9692796610169492
Training Precision: 0.9483627204030227
Training Recall:    0.8498871331828443
Training AUC-ROC:   0.9206530684750555

Testing

In [48]:
test_predictions_3 = model_3.predict(x_test)
In [49]:
cm_test_3 = confusion_matrix(y_test, test_predictions_3)
sns.heatmap(cm_test_3, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Testing Confusion Matrix")
Out[49]:
Text(0.5, 1.0, 'Testing Confusion Matrix')
In [50]:
print("Testing Accuracy:  ", accuracy_score(y_test, test_predictions_3))
print("Testing Precision: ", precision_score(y_test, test_predictions_3))
print("Testing Recall:    ", recall_score(y_test, test_predictions_3))
print("Testing AUC-ROC:   ", roc_auc_score(y_test, test_predictions_3))
Testing Accuracy:   0.9040225829216655
Testing Precision:  0.7570621468926554
Testing Recall:     0.5903083700440529
Testing AUC-ROC:    0.7770869581312702

It looks like we definitely did beat the random forest classifier! Our recall is around 0.6, which is not as high as it could get with more tuning or a different model, but this is sufficient to move onto explaining our champion model's predictions.

6. Explainability and Insight

We've run three different ML models on the data in increasing order of test performance. Given time, we could play with and tweak these models forever. We could even delve into deep learning models, which would train for longer and probably perform better on test data. However, the goal of this project is not really on optimal performance - rather, it is on being able to explain the link between factors that resulted in attrition.

To preserve that purpose, we'll use a tool called LIME (Local Interpretable Model-agnostic Explanations) on our best-performing ML model, our XGBoost classifier.

Let's first install the package into the environment.

In [51]:
!pip install lime
Requirement already satisfied: lime in /usr/local/lib/python3.7/dist-packages (0.2.0.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from lime) (3.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from lime) (1.19.5)
Requirement already satisfied: scikit-image>=0.12 in /usr/local/lib/python3.7/dist-packages (from lime) (0.16.2)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.7/dist-packages (from lime) (0.22.2.post1)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from lime) (4.41.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from lime) (1.4.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (1.3.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (2.4.7)
Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2.4.1)
Requirement already satisfied: pillow>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (7.1.2)
Requirement already satisfied: PyWavelets>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (1.1.1)
Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2.5.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.18->lime) (1.0.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->lime) (1.15.0)
Requirement already satisfied: decorator<5,>=4.3 in /usr/local/lib/python3.7/dist-packages (from networkx>=2.0->scikit-image>=0.12->lime) (4.4.2)

We can instantiate an explainer object that can take in individual datapoints and our chosen model's prediction on that datapoint, and then show us what factors the model deemed most relevant in that case of attrition.

In [52]:
import lime 
import lime.lime_tabular

list_of_features = df_x.columns

explainer = lime.lime_tabular.LimeTabularExplainer(
    x_train, 
    mode="classification", 
    feature_names=list_of_features,
    )

Let's store a list of all the indices of our test set that had attrited customers (label 1) that our model was able to successfully identify, so that LIME can analyze why the model thought that customer left.

In other words, we want to find a few true positives out of our ML model's predictions to understand why each of them attrited.

LIME works best as of now at being able to explain individual data points, which is why we'll be running it only on a small set of true positives out of the whole list we're collecting below.

In [53]:
true_positive_indices = []
for i in range(len(y_test)):
  if y_test[i] == 1 and model_3.predict(x_test[i].reshape(1,-1)) == 1:
    true_positive_indices.append(i)

For demonstration, let's run our LIME explainer on the first three correctly identified cases of attrition on our list of true_positive_indices.

In [54]:
idx = true_positive_indices[0]
print("Prediction:", model_3.predict(x_test[idx].reshape(1,-1)))
print("Actual:    ", y_test[idx])

explanation = explainer.explain_instance(
    x_test[idx], 
    model_3.predict_proba, 
    num_features=len(list_of_features))

explanation.show_in_notebook()
Prediction: [1]
Actual:     [1]
In [55]:
idx = true_positive_indices[1]
print("Prediction:", model_3.predict(x_test[idx].reshape(1,-1)))
print("Actual:    ", y_test[idx])

explanation = explainer.explain_instance(
    x_test[idx], 
    model_3.predict_proba, 
    num_features=len(list_of_features))

explanation.show_in_notebook()
Prediction: [1]
Actual:     [1]
In [56]:
idx = true_positive_indices[2]
print("Prediction:", model_3.predict(x_test[idx].reshape(1,-1)))
print("Actual:    ", y_test[idx])

explanation = explainer.explain_instance(
    x_test[idx], 
    model_3.predict_proba, 
    num_features=len(list_of_features))

explanation.show_in_notebook()
Prediction: [1]
Actual:     [1]

Interpreting LIME Results

A note on how to see what LIME is telling us:

  1. The panel on the left, titled "Prediction probabilities", tells us the amount of confidence with which our model guessed that particular case of attrition.
  2. The panel in the middle is a display of features in decreasing order of influence on the prediction. Values towards the right (orange) positively contributed, and those to the left (blue) negatively contributed.
  3. The panel on the right is a table showing the actual feature values in the dataset for that particular instance. However, since we scaled our information using standard deviations in the Data Split and Scaling part of 5. Machine Learning, these values in this table aren't the actual raw values.

Note: due to the way that ML models train, and due to the shuffling of the dataset, these results will slightly vary every time this notebook is run. I did not implement a random seed to preserve the real-time operation of this code, should someone want to download this notebook and run/tweak it themselves.

Insights Drawn

From all the analysis we've done, including studying the correlations to Attrition_Flag on the heatmap and interpreting what LIME has revealed to us about our XGBoost model's understanding of influential features, we can identify that among the most influential are the following:

  • Total_Revolving_Bal - the lower it is, the greater the risk of attrition
  • Total_Amt_Chng_Q4_Q1 - the lower it is, the greater the risk of attrition
  • Months_Inactive_12_mon - the higher it is, the greater the risk of attrition
  • Contacts_Count_12_mon - the higher it is, the more dissatisfied a customer seems to be
  • Amt_Per_Trans - the higher it is, the smaller the chance of attrition.

All of these are factors that make sense, and being able to act on incentivizing credit card customers accordingly and maintaining a good relationship with them can help the bank minimize its attrition.

And not only have we identified key factors that result in attrition, but we also have an explainable way to predict if someone is likely to attrite, given the data we need about them. This enhances the real-world usability of a pipeline like this.

7. Conclusion and Resources

This isn't the shortest data science tutorial, as far as they go. However, the problem we tackled was an actual business need in terms of not only having a solution that could guess which customers were likely to leave, but also knowing what factors influenced that customer's decision, according to our explainability framework.

We've gone over a wide range of concepts and tools, spanning from organizing our variables to taking a look at their distributions to identifying interdependence within the "independent" set of feature variables to pinpointing causes of multicollinearity to cleaning those variables up in feature engineering to running three machine learning models on them and finally providing our champion model with the ability to explain its predictions.

As we can imagine, the usability of such a solution (albeit hopefully with a slightly better-performing model) in the real world is not only rooted in preditive capability, but in trust. Not just the business user's trust in what was a black-box algorithm, but the intrinsic idea that a human can trust a self-teaching intelligent solution to explain itself and its decisions.

With a more detailed field-specific understanding (knowing the applied meaning of the features we had), more extensive search for the right model, a more rigorous hyperparameter tweaking process, this data science pipeline could be improved. But hopefully, this tutorial has given you some insight as to what goes into code that decodes.

Thank you for taking the time to go through this tutorial, and below are some links that you can use to explore the tools and ideas we used even further.

Since this notebook was created in Google Colab, the code cell below is simply to help me download it as a .html file to submit as my CMSC320 final project for Spring 2021 at UMD.

In [59]:
!jupyter nbconvert --to html /content/CMSC320_Final_BankChurners.ipynb
[NbConvertApp] Converting notebook /content/CMSC320_Final_BankChurners.ipynb to html
[NbConvertApp] Writing 4703920 bytes to /content/CMSC320_Final_BankChurners.html